Overview

Differences from deterministic modeling

A stochastic model incorporates randomness; given the same inputs, a stochastic model may generate different outputs on different runs.

This is useful when the process you want to model is fundamentally stochastic, or if there is heterogeneity in its dynamics.

Unfortunately stochastic models can get complicated, and come with a lot of notation and formalism. Our goal here is to quickly review basic probability theory so that we can use stochastic models productively.


Random variables

A random variable is a variable whose value is random. In this course, random variables will usually have some real-world meaning, like:

  • \(X\) is the value of a fair coin toss
  • \(X\) is the number of new HIV infections diagnosed this week
  • \(X\) is the number of people who had a drug overdose in the last 24 hours
  • \(X\) is the survival time of a subject living with cancer

Probability

Probability \(\Pr(\cdot)\) is a function whose argument is the value a random variable can take on, which returns a positive number less than or equal to 1. The set of all such values describes the probability distribution.

For discrete random variables, we will write the probability mass function \(\Pr(X=x)\). For continuous random variables, we will use the density, defined below.

We usually describe random varibles in terms of their probability distribution. For example, consider the random variable \(X\) defined by \(\Pr(X=0) = 1/2\) and \(\Pr(X=1)=1/2\). What might \(X\) refer to?


Cumulative probability distribution function

All random variables on the real line have a cumulative distribution funciton (CDF) \[ F(x) = \Pr(X < x) \] The cumulative distribution function fully characterizes the probability distribution of \(X\).


Probability density

The density of a continuous random varible is \(f(x)\), the derivative of CDF \(F(x)\) of \(X\), where it exists \[ f(x) = \lim_{h\to 0} \frac{F(x+h) - F(x)}{h} \]


Conditional probability

\(\Pr(Y=y|X=x)\) is the probability that \(Y=y\), given that \(X=x\). We interpret the statement on the right-hand side of the conditioning sign \(|\) as deterministic. Bayes’ theorem says how to compute conditional probabilities:

\[ \Pr(Y=y|X=x) = \frac{\Pr(Y=y,X=x)}{\Pr(X=x)} \]


Expectation

The expected value of a random variable is its “average” value.

\[ E[X] = \sum_x x \Pr(X=x) \]

or

\[ E[X] = \int_{-\infty}^\infty x f(x) dx \]

where \(f(x)\) is the density of \(X\). The expectation is the average of values x that X can take on, weighted by their probability.

Can a random variable \(X\) have expectation equal to a value that \(X\) can never take? Yes. If \(X\sim \text{Bernoulli}(1/2)\), then \(\Pr(X=x)=1/2\) for \(x=1\) and \(x=0\). But \(E[X]=1/2\).


Conditional expectation

Conditional expectations are just expectations with respect to a conditional probability distribution. \[ E[Y|X=x] = \sum_y y\Pr(Y=y|X=x) \] \[ E[Y|X=x] = \int_{-\infty}^\infty y f(y|X=x) dy \]


Law of total probability

\[ \Pr(X=x) = \sum \Pr(X=x|Y=y) \Pr(Y=y) \] \[ \Pr(X=x) = \int \Pr(X=x|Y=y) f(y) dy \]

Law of total expectation

\[ E[X] = \sum E[X|Y=y] \Pr(Y=y) \] \[ E[X] = \int_{-\infty}^\infty E[X=x|Y=y] f(y) dy \]


Independence

Two random variables are independent if their joint probability factorizes into the product of their marginal probabilities, \[ \Pr(X=x,Y=y) = \Pr(X=x) \Pr(Y=y) \] or \(f(x,y) = f(x)f(y)\). e.g. independent coin flips, die rolls, etc.

Discrete probability distributions

Bernoulli

This is the classic coin-flipping distribution.

The random variable \(X\) has Bernoulli\((p)\) distribution if \(\Pr(X=1)=p\) and \(\Pr(X=0)=1-p\).


Binomial

A Binomial\((n,p)\) random variable is a sum of \(n\) Bernoulli variables, each independent with probability \(p\). The probability mass function is \[ \Pr(X=k) = \binom{n}{k} p^k (1-p)^{n-k} \] where \(0 \le k \le n\).


Geometric

A Geometric\((p)\) is the number of Bernoulli\((p)\) trials to achieve a given number successes. The probability mass function is

\[ \Pr(X=k) = (1-p)^{k-1} p \] where \(0 \le k\).


Poisson

A common probability model for count data is \[ \Pr(X = k) = \frac{(\lambda t)^k e^{-\lambda t}}{k!} \] This distribution looks weird, but it has a nice story, which we will see when we talk about Poisson processes.

Continuous probability distributions

Uniform

Usually defined on \([0,1]\). \(X \sim \text{Unif}(0,1)\) means the density is \[ f(x) = 1 \] for \(0\le x \le 1\).

Beta

If \(X\) has beta distribution, the density is

\[ f(x) \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} \]

for \(0\le x \le 1\), where

\[ B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \]


Exponential

If \(X\) has exponential distribution, the density is

\[ f(x) = \lambda e^{-\lambda x} \]

\[ F(x) = 1-e^{-\lambda t} \]

Memoryless property: \[ \Pr(X>t+h| X>t) = \frac{\Pr(X>t+h)}{\Pr(X>t)} = \frac{e^{-\lambda(t+h)}}{e^{-\lambda t}} = e^{-\lambda h} = \Pr(X>h) \]


Normal

The normal\((\mu,\sigma^2)\) density is

\[ f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left[ -\frac{1}{2\sigma^2} (x-\mu)^2 \right] \]

Where does this functional form come from? It’s complicated.


Other probability distributions

There are lots of other “canonical” probability distributions

Discrete: Negative binomial, beta binomial, counting distributions

Continuous: Gamma, Lognormal, Laplace, survival/accelerated failure time distributions

We won’t use these in this class, but it’s important to know that there are lots of options for modeling random quantities. This is partly what the field of statistics is about.

Decision trees

Decision trees as probability models.

A decision tree represents a sequence of decisions, actions, or states. You can think of an individual jumping from left to right along the nodes in the tree. At each node, they follow a link with the given probabilities. This means their path through the tree is stochastic.

Simple decision tree for the treatment of individuals diagnosed with a certain disease. Following diagnosis, each individual gets treatment A or B. Then either they survive or die by some follow-up time, depending on which treatment they got.

Computing with decision trees

Computing “marginal” and conditional probabilities is easy via law of total probability: \[ \Pr(\text{Alive}|\text{Treatment A}) = 0.7 \] \[ \Pr(\text{Alive}|\text{Treatment B}) = 0.1 \] \[ \Pr(\text{Alive}) = 0.4 \times 0.7 + 0.6 \times 0.1 = 0.34 \]


A note about decision trees

The total probability of being alive is \[ \Pr(\text{Alive}) = 0.9 \times 0.6 + 0.1 \times 0.9 = 0.63\]

Let’s look at the alive people and see which treatment they got: \[ \Pr(\text{Treatment A}|\text{Alive}) = 0.9 \times 0.6 / 0.63 = 0.86 \] \[ \Pr(\text{Treatment B}|\text{Alive}) = 0.1 \times 0.9 / 0.63 = 0.14 \]

Most of the alive people got Treatment A. Treatment A must be best! Is that true?


Dynamic stochastic models

Markov models

In this course, we will focus on Markov stochastic models in discrete and continuous time. A Markov model is one whose next event depends only on its current state, and not on its prior history (N. T. Bailey and Bailey (1990), Renshaw (2015))


Discrete-time Markov models

Let \(X(t)\) be a Markov process taking \(n\) different states, in discrete time, \(t=1,2,\ldots\). Define the matrix of one-step transition probabilities from each state \(i\) to each state \(j\),

\[ \mathbf{P} = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{n1} & p_{n2} & \cdots & p_{nn} \end{pmatrix} \]

Define the transition probability of moving from state \(i\) to state \(j\) in time \(t\) as

\[ P_{ij}(t) = \Pr(X(t) = j | X(0) = i) \]

You can compute transition probabilities by multiplying \(\mathbf{P}\) by itself \(t\) times:

\[ P_{ij}(t) = [\mathbf{P}^t]_{ij} \]


Continuous-time Markov models

We described deterministic systems in terms of their rates earlier. But a continuous-time stochastic model produces different output each time we run it. How can we describe its rates of change when its path is different every time?

We need to describe the rate of change of its transition probabilities. Consider a continuous-time Markov chain \(X(t)\) on a discrete state space.

The transition rate between states \(i\) and \(j\) is \(\lambda_{ij}\) and define the rate matrix
\[ \mathbf{Q} = \begin{pmatrix} -\lambda_{11} & \lambda_{12} & \cdots & \lambda_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ \lambda_{n1} & \lambda_{n2} & \cdots & -\lambda_{nn} \end{pmatrix} \]

where \(\lambda_{ii} = \sum_{j\neq i} \lambda_{ij}\). The transition probabilities \(P_{ij}(t) = \Pr(X(t)=j|X(0)=i)\) can be computed by:

\[ \mathbf{P}(t) = \exp[\mathbf{Q} t] \]

Waiting times to jumps

Consider a continuous-time Markov model at time \(t\), where \(X(t)=i\). The rate of jumping to state \(j\) is \(\lambda_{ij}\). What does this mean?

It means that the jump to \(j\) occurs after a waiting time with distribution Exponential\((\lambda_{ij})\). If there are many possibilities \(k\) with \(\lambda_{ik}>0\), then the waiting time to the next jump to any state is the minimum of all these exponential waiting times. Fortunately,

\[ \min\{W_{ij}:\ j\neq i\} \sim \text{Exponential}\left(\sum_{j\neq i} \lambda_{ik}\right) \]

Furthermore, the destination of the next jump is independent of this waiting time.

Properties of continuous-time Markov models

Several properties of CTMCs are useful:

  1. Markov property
  2. Exponential waiting times to the next jump
  3. Independence of waiting time and destination
  4. Easy simulation

Drawbacks:

  1. Difficult to compute moments
  2. Difficult to compute transition probabilities
  3. Difficult to estimate parameters from discretely observed data
  4. Infinite state spaces require special methods

Poisson process

Now consider a continuous-time Markov chain whose only positive transition rates are to the next integer state, \(\lambda_{i,i+1}=\lambda\). This is called a counting process. Let \(X(t)=i\), so the waiting time to the jump to \(i+1\) is

\[ W_i \sim \text{Exponential}(\lambda) \]

following the last event. Given \(X(0)=0\), what is the distribution of the number of jumps \(X(t)\)?

\[ X(t) \sim \text{Poisson}(\lambda t) \]

So \(E[X(t)] = \lambda t\). This is where the Poisson distribution comes from.


Queues

A queue is model for a list of individuals waiting to be served. Examples:

  • Emergency room beds
  • Vaccine inventory
  • HIV care pipeline

The simplest queue is a continuous-time Markov model with \(\lambda_{i,i+1}=\lambda\) and \(\lambda_{i,i-1}=\mu\).

A slightly more complicated queue has \(\lambda_{i,i+1}=\lambda\) and \(\lambda_{i,i-1}=i\mu\).

Queues are useful because they can be combined into a stochastic model for sequential movement through a series of compartments or states.


Example of a queueing model

Gonsalves et al. (2017)

Simulating continuous-time Markov models

Gillespie algorithm

The “Gillespie algorithm” performs forward simulation of a stochastic compartmental model (Pineda-Krch (2008)). You don’t really need to know how it works. Forward simulation of continuous-time Markov models is computationally straightforward.

The ssa function in the GillespieSSA package implements a fast version of the Gillespie algorithm.


Example: pure birth

\(X(t)\) counts the number of individuals in a growing population. The transition rates are \(\lambda_{i,i+1}=i\lambda\).

library(GillespieSSA)
parms <- c(lam = 1)
x0 <- c(X = 1)
a <- c("lam*X")
nu <- matrix(1)
out <- ssa(x0, a, nu, parms, tf = 10, simName = "Pure birth")

Example: pure birth

plot(out$data[, 1], out$data[, 2], col = "red", 
    pch = 19, xlab = "Time", ylab = "Number", 
    bty = "n")


Explanation of ssa

ssa(x0, a, nu, parms, tf)

Here, x0 is the starting state, a is the transition rate, nu is the change in the compartment, parms is the parameters, and tf is the final time.

The function returns a data frame with jump times and states.

head(out$data)
##                     X
## timeSeries 0.000000 1
##            2.802968 2
##            3.783844 3
##            4.515612 4
##            4.563723 5
##            5.003653 6

Example: Stochastic logistic growth

parms <- c(b = 2, d = 1, K = 1000)
x0 <- c(N = 500)
a <- c("b*N", "(d+(b-d)*N/K)*N")
nu <- matrix(c(+1, -1), ncol = 2)
out <- ssa(x0, a, nu, parms, tf = 10, simName = "Logistic growth")

Example: Logistic growth

plot(out$data[, 1], out$data[, 2], type = "l", 
    xlab = "Time", ylab = "Number", bty = "n")
abline(h = 1000, lty = "dashed", col = "gray")


Example: SIR

parms <- c(beta = 0.005, gamma = 0.1)
x0 <- c(S = 499, I = 1, R = 0)
a <- c("beta*S*I", "gamma*I")
nu <- matrix(c(-1, 0, +1, -1, 0, +1), nrow = 3, 
    byrow = TRUE)
out <- ssa(x0, a, nu, parms, tf = 100, simName = "SIR model")

Example: SIR

plot(out$data[, 1], out$data[, 2], col = "green", 
    ylim = c(0, 500), type = "l", xlab = "Time", 
    ylab = "Number", bty = "n")
lines(out$data[, 1], out$data[, 3], col = "red")
lines(out$data[, 1], out$data[, 4], col = "blue")
legend(50, 500, c("S", "I", "R"), pch = 16, 
    col = c("green", "red", "blue"))

“Statistical” models

“Statistical models” are stochastic because they contain random variables. Consider the usual Normal/Gaussian linear model \[ y = \alpha + \beta x + \epsilon \] where \(\epsilon \sim N(0,\sigma^2)\). This model is stochastic because \(\epsilon\) is random.

In my opinion, you should think about so-called statistical models as mechanisitic.

References

Bailey, Norman T, and Norman TJ Bailey. 1990. The Elements of Stochastic Processes with Applications to the Natural Sciences. Vol. 25. John Wiley & Sons.

Gonsalves, Gregg S, A David Paltiel, Paul D Cleary, Michael J Gill, Mari M Kitahata, Peter F Rebeiro, Michael J Silverberg, et al. 2017. “A Flow-Based Model of the Hiv Care Continuum in the United States.” Journal of Acquired Immune Deficiency Syndromes (1999) 75 (5). NIH Public Access: 548–53.

Pineda-Krch, Mario. 2008. “GillespieSSA: Implementing the Stochastic Simulation Algorithm in R.” Journal of Statistical Software 25 (12): 1–18.

Renshaw, Eric. 2015. Stochastic Population Processes: Analysis, Approximations, Simulations. Oxford University Press.